[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error #3810

szrlee · 2025-10-18T08:36:55Z

Summary

Fixes #3787 by removing torch.quantile()-based percentile metrics (rollout_is_p25, rollout_is_p50, rollout_is_p75) that caused RuntimeError: quantile() input tensor is too large when using large batch sizes or response lengths.

Problem

When using configurations with large tensor sizes (e.g., max_response_length: 32k, rollout.n: 16, train_batch_size: 16), the torch.quantile() function fails with a runtime error due to PyTorch's internal tensor size limitations (~2^24 to 2^27 elements depending on version, GPU memory, and dtype).

The error occurred in verl/trainer/ppo/mismatch_helper.py:

metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25)
metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50)
metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75)

Solution

Removed the three quantile-based percentile metrics from the Rollout IS framework. The remaining metrics (rollout_is_mean, rollout_is_std, rollout_is_min, rollout_is_max, rollout_is_eff_sample_size, etc.) provide sufficient monitoring capabilities for importance sampling health without triggering tensor size limitations.

Changes

Modified: verl/trainer/ppo/mismatch_helper.py
- Removed rollout_is_p25, rollout_is_p50, rollout_is_p75 metric calculations
- All other rollout IS and mismatch metrics remain functional

Testing

Verified that:

Rollout IS framework continues to function correctly without percentile metrics
No runtime errors with large tensor configurations
All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are computed correctly

Resolves #3787

Remove p25, p50, p75, p95, and p99 percentile metrics from rollout importance sampling. These metrics used torch.quantile() which can be computationally expensive. The remaining distribution metrics (mean, std, min, max, eff_sample_size) provide sufficient monitoring coverage. Changes: - Remove quantile computation from compute_is_metrics() - Update test expectations to remove percentile metrics - Remove percentile metrics from documentation and examples

gemini-code-assist

Code Review

This pull request effectively resolves the torch.quantile tensor size limit error by removing the percentile-based metrics. The core code change in mismatch_helper.py is correct, and the associated updates to tests and most of the documentation are consistent with this removal. I've identified one high-severity issue in the documentation that needs to be addressed to ensure the provided examples are runnable.

docs/advance/rollout_is.md

…uide Remove references to torch.quantile-based percentile metrics (p25, p50, p75, p95, p99) from the plotting function and metrics history example to align with the codebase changes that removed these metrics.

- Update all references to use rollout_is naming consistently

… tensor size limit error (volcengine#3810) ## Summary Fixes volcengine#3787 by removing `torch.quantile()`-based percentile metrics (`rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75`) that caused `RuntimeError: quantile() input tensor is too large` when using large batch sizes or response lengths. ## Problem When using configurations with large tensor sizes (e.g., `max_response_length: 32k`, `rollout.n: 16`, `train_batch_size: 16`), the `torch.quantile()` function fails with a runtime error due to PyTorch's internal tensor size limitations (~2^24 to 2^27 elements depending on version, GPU memory, and dtype). The error occurred in `verl/trainer/ppo/mismatch_helper.py`: ```python metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25) metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50) metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75) ``` ## Solution Removed the three quantile-based percentile metrics from the Rollout IS framework. The remaining metrics (`rollout_is_mean`, `rollout_is_std`, `rollout_is_min`, `rollout_is_max`, `rollout_is_eff_sample_size`, etc.) provide sufficient monitoring capabilities for importance sampling health without triggering tensor size limitations. ## Changes - **Modified**: [verl/trainer/ppo/mismatch_helper.py](verl/trainer/ppo/mismatch_helper.py) - Removed `rollout_is_p25`, `rollout_is_p50`, `rollout_is_p75` metric calculations - All other rollout IS and mismatch metrics remain functional ## Testing Verified that: - Rollout IS framework continues to function correctly without percentile metrics - No runtime errors with large tensor configurations - All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are computed correctly Resolves volcengine#3787

szrlee requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992 and zhaochenyang20 as code owners October 18, 2025 08:36

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

docs/advance/rollout_is.md Show resolved Hide resolved

refactor(docs): remove percentile metrics from rollout IS migration g…

30247fc

…uide Remove references to torch.quantile-based percentile metrics (p25, p50, p75, p95, p99) from the plotting function and metrics history example to align with the codebase changes that removed these metrics.

szrlee requested review from FightingZhen and ji-huazhong as code owners October 19, 2025 07:59

refactor(docs): rename rollout_is_migration to rollout_is

2eb77b8

- Update all references to use rollout_is naming consistently

szrlee force-pushed the yingru/rollout-is-fix-metrics branch from e924de2 to 2eb77b8 Compare October 19, 2025 08:58

szrlee changed the title ~~Fix: Remove torch.quantile-based percentile metrics to resolve tensor size limit error~~ [data] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025

szrlee changed the title ~~[data] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error~~ fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025

wuxibin89 approved these changes Oct 20, 2025

View reviewed changes

wuxibin89 changed the title ~~fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error~~ [algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025

wuxibin89 merged commit 4f1c489 into volcengine:main Oct 20, 2025
87 of 97 checks passed

szrlee mentioned this pull request Oct 27, 2025

[algo] refactor: Rollout Importance Sampling - Separate IS Weights from Rejection Sampling #3915

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error #3810

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error #3810

Uh oh!

szrlee commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error #3810

[algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error #3810

Uh oh!

Conversation

szrlee commented Oct 18, 2025

Summary

Problem

Solution

Changes

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants